Analysis of COVID-19 in Utah
Overview of Covid-19 Data in Utah by County
Analysis Objective:
The analysis objective in this situation is to look into the relationships between the demographics of Utah counties and the number of Covid-19 cases, deaths, and prevalence of mask wearing in those counties.
Select Data:
All data points are from Utah and all of its counties. The data includes various demographics (median age, population, race) for all Utah counties and Covid-19 data: case numbers, death numbers, and percentage of reported mask wearing in counties based on never, rarely, sometimes, frequently, and always. There are, in total, 7317 observations of 52 variables.
Data Analysis:
Unsupervised Analysis: Kmeans Clustering
Covid-19 Cases and Prevalence of Mask Wearing and County - Overview
## cases NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
## 311 1 0.028 0.032 0.094 0.202 0.644
## 331 1 0.028 0.032 0.094 0.202 0.644
## 351 1 0.028 0.032 0.094 0.202 0.644
## 372 1 0.028 0.032 0.094 0.202 0.644
## 394 1 0.028 0.032 0.094 0.202 0.644
## 425 1 0.028 0.032 0.094 0.202 0.644
Summary of Covid Data
#unsupervised
#remove nas
summary(complete_Covid)
## cases NEVER RARELY SOMETIMES
## Min. : 1 Min. :0.00200 Min. :0.02300 Min. :0.0330
## 1st Qu.: 14 1st Qu.:0.04000 1st Qu.:0.04300 1st Qu.:0.0720
## Median : 103 Median :0.06800 Median :0.06400 Median :0.0980
## Mean : 1703 Mean :0.09171 Mean :0.09169 Mean :0.1162
## 3rd Qu.: 650 3rd Qu.:0.09900 3rd Qu.:0.11400 3rd Qu.:0.1410
## Max. :74269 Max. :0.43200 Max. :0.29600 Max. :0.2710
## FREQUENTLY ALWAYS
## Min. :0.1710 Min. :0.1750
## 1st Qu.:0.2120 1st Qu.:0.3530
## Median :0.2690 Median :0.4210
## Mean :0.2691 Mean :0.4312
## 3rd Qu.:0.3000 3rd Qu.:0.5180
## Max. :0.4690 Max. :0.6510
Plotting the different aspects to find the best number of Clusters for the analysis
The elbow bend is at 5, so I created groupings based on 5 clusters.

I created 5 clusters with 25 random starting assignments.
In order to get a good overall view of the clustering, I aggregated the results so as show the means for each cluster.
I did not include the original clustering assignments as the vector of integers which indicated the cluster assignation for each data point was very large.
CClusters <- kmeans(complete_Covid, 5, nstart = 25)
aggregate(complete_Covid, by=list(cluster=CClusters$cluster),mean)
## cluster cases NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
## 1 1 364.0402 0.09606134 0.09584397 0.1163072 0.2706973 0.4210397
## 2 2 37829.0492 0.04373770 0.03318033 0.1337377 0.2185246 0.5708197
## 3 3 21819.3361 0.04227049 0.03362295 0.1291803 0.2195574 0.5753443
## 4 4 6778.7797 0.05196203 0.05896456 0.1097646 0.2715291 0.5074532
## 5 5 61329.0870 0.02800000 0.03200000 0.0940000 0.2020000 0.6440000
At this point, I included a small overview of each cluster and the county that it belongs to.
Supervised: Estimation - Regression to explain the variablity in COVID-19 cases
In this type of model, I wanted to include more variables in addition to the COVID-19 data. I also included demographic data for all Utah counties, including race, income, population, smoker, obesity, diabetes, median age, and uninsured.
I then created a linear regression model to be able to see the most significant variables in the data.
By looking at the residuals of this data, I can tell that this data needs to be explored further in order to achieve residuals that are more in line with one another. I ran a few variations on this model and finally settled on the current model. It still needs to be worked with in order to bring the residuals more in line with one another.
When looking at the Adjusted R-squared and Multiple R-squared, I can see that both are around .90 which tells us that this is a good model.
The F-statistic is also high and the p-value is low.
The model explains 90% of the variability in cases of COVID-19.
##
## Call:
## lm(formula = cases ~ ., data = int_small_decision_tree)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9096.4 -217.2 -6.7 104.7 21112.7
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) -1.207e+06
## deaths 1.654e+02
## NEVER 8.300e+05
## RARELY 8.657e+05
## SOMETIMES 7.968e+05
## FREQUENTLY 8.006e+05
## ALWAYS 8.204e+05
## Less.Than.High.School -8.917e+02
## At.Least.High.School.Diploma NA
## At.Least.Bachelor.s.Degree 5.179e+02
## Graduate.Degree -4.668e+02
## School.Enrollment -4.433e+01
## Median.Earnings.2010.dollars -5.116e-01
## White.not.Latino.Population 3.969e+03
## African.American.Population 7.727e+03
## Native.American.Population 4.060e+03
## Asian.American.Population -1.869e+03
## Population.some.other.race.or.races 4.464e+03
## Latino.Population 3.957e+03
## Total.Population 7.104e-03
## Construction.extraction.maintenance.and.repair.occupations 4.022e+01
## median_age 1.387e+02
## Adult.smoking 9.806e+03
## Adult.obesity 1.940e+04
## Std. Error t value
## (Intercept) 2.825e+05 -4.273
## deaths 1.007e+00 164.270
## NEVER 7.648e+05 1.085
## RARELY 7.740e+05 1.118
## SOMETIMES 7.399e+05 1.077
## FREQUENTLY 7.600e+05 1.053
## ALWAYS 7.641e+05 1.074
## Less.Than.High.School 9.436e+02 -0.945
## At.Least.High.School.Diploma NA NA
## At.Least.Bachelor.s.Degree 1.204e+02 4.303
## Graduate.Degree 7.841e+01 -5.954
## School.Enrollment 2.075e+01 -2.137
## Median.Earnings.2010.dollars 3.356e-01 -1.524
## White.not.Latino.Population 7.026e+03 0.565
## African.American.Population 9.575e+03 0.807
## Native.American.Population 6.926e+03 0.586
## Asian.American.Population 7.427e+03 -0.252
## Population.some.other.race.or.races 5.625e+03 0.794
## Latino.Population 6.855e+03 0.577
## Total.Population 1.656e-03 4.290
## Construction.extraction.maintenance.and.repair.occupations 1.488e+02 0.270
## median_age 2.770e+02 0.501
## Adult.smoking 2.066e+04 0.475
## Adult.obesity 8.064e+03 2.406
## Pr(>|t|)
## (Intercept) 1.96e-05 ***
## deaths < 2e-16 ***
## NEVER 0.2778
## RARELY 0.2634
## SOMETIMES 0.2815
## FREQUENTLY 0.2922
## ALWAYS 0.2830
## Less.Than.High.School 0.3447
## At.Least.High.School.Diploma NA
## At.Least.Bachelor.s.Degree 1.72e-05 ***
## Graduate.Degree 2.79e-09 ***
## School.Enrollment 0.0327 *
## Median.Earnings.2010.dollars 0.1275
## White.not.Latino.Population 0.5722
## African.American.Population 0.4197
## Native.American.Population 0.5577
## Asian.American.Population 0.8013
## Population.some.other.race.or.races 0.4274
## Latino.Population 0.5638
## Total.Population 1.82e-05 ***
## Construction.extraction.maintenance.and.repair.occupations 0.7869
## median_age 0.6165
## Adult.smoking 0.6351
## Adult.obesity 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2039 on 5370 degrees of freedom
## Multiple R-squared: 0.9063, Adjusted R-squared: 0.906
## F-statistic: 2362 on 22 and 5370 DF, p-value: < 2.2e-16
Decision Tree - Classification
I wanted to find local effects of the data and improve the model by creating a Decision Tree. The data is broken up by county in each of the models, so I can see how cases are impacted by each variable.
I knew from our linear model that death was one of the important variables in the linear model, which is not that surprising. I also wanted to see what else I could find that would show up in the Decision Tree model.
I can see that in all of the counties of Utah they are grouped the first time based on number of deaths. From there the cases are divided further by number of deaths and then they begin to be influence by different demographic variables, not necessarily those that were marked as important in the regression model.
## n= 4313
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4313 168233800000 1977.8020
## 2) deaths< 65.5 4139 18590430000 934.7949
## 4) deaths< 30.5 3921 3487652000 538.8059
## 8) deaths< 5.5 3268 420548600 236.7583 *
## 9) deaths>=5.5 653 1276842000 2050.4320 *
## 5) deaths>=30.5 218 3429258000 8057.1470 *
## 3) deaths>=65.5 174 38034090000 26788.1800
## 6) deaths< 270.5 140 12725850000 21324.3600
## 12) Less.Than.High.School>=8 106 7987302000 18537.6300
## 24) deaths< 139 40 519388700 9044.7250 *
## 25) deaths>=139 66 1678689000 24290.9100 *
## 13) Less.Than.High.School< 8 34 1348991000 30012.3800 *
## 7) deaths>=270.5 34 3919152000 49286.2900
## 14) deaths< 316 21 461526400 42037.9000 *
## 15) deaths>=316 13 572013800 60995.2300 *

Apply Analysis
According to the linear regression model, the most important variables were death, Education: Less than High school, Education: At least Bachelor’s, School Enrollment, and Total Population. The most important of the single variables was Education according to this particular model.
I also built a regression tree so that I could see the interaction points within the models. Through this tree model, I could see interactions that were not seen in the regression model.
In combining what these models have shown, I can see the important data points in seeing the influence of different demographics and mask use on the number of cases in different counties in Utah. Not only can I see the importance of looking into different demographics and seeing how COVID-19 is effecting those populations and use this information to better serve those populations, I can see how reporting of mask use also impacts the number of cases in each county.
Deploy Model
I can use this information from the cluster groups, the significant variables from the regression, and the significant breakpoints from the Decision Tree to create a plan of how to create an action plan for outreach within the community.
Assess Results
These models definitely need to be re-evaluated to focus on the most important variables. Even though the regression model was highly accurate, I believe that creating an alternative model that removes the ‘deaths’ variables from the models might create a different set of variables to focus on when it comes to analyzing cases of COVID-19 in Utah.
Strengths of Kmeans Clustering, Decision Tree, and Regression Analysis
Using a combination of each of these models allows the model builder to check and change how the model is built. By looking at the different clusters that the data can be put into can illustrate how the data can be fit into separate categories. From there, I can build the regression model and see the significance of each separate variable and attempt to create a better model from those significant variables.
Finally, by using that model to build a Decision Tree, I can see how the variables interact and their breakpoints. Given this information that was not specified in the regression model, I can go back and alter the regression model to reflect those newly discovered interactions.
Using a combination of all of these model types allows the model builder to plan, build, discover, and then plan, build, discover again until a more effective model is ready to be operationalized.